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Abstract 

Automatic organ segmentation is an important yet challenging problem 
for medical image analysis. The pancreas is an abdominal organ with very 
high anatomical variability. This inhibits previous segmentation methods from 
achieving high accuracies, especially compared to other organs such as the 
liver, heart or kidneys. In this paper, we present a probabilistic bottom-up 
approach for pancreas segmentation in abdominal computed tomography (CT) 
scans, using multi-level deep convolutional networks (ConvNets). We propose 
and evaluate several variations of deep ConvNets in the context of hierarchi¬ 
cal, coarse-to-fine classification on image patches and regions, i.e. superpixels. 
We first present a dense labeling of local image patches via P-ConvNet and 
nearest neighbor fusion. Then we describe a regional ConvNet (i?i—ConvNet) 
that samples a set of bounding boxes around each image superpixel at dif¬ 
ferent scales of contexts in a “zoom-out” fashion. Our ConvNets learn to 
assign class probabilities for each superpixel region of being pancreas. Last, 
we study a stacked i? 2 —ConvNet leveraging the joint space of CT intensities 
and the P—ConvNet dense probability maps. Both 3D Gaussian smoothing 
and 2D conditional random fields are exploited as structured predictions for 
post-processing. We evaluate on CT images of 82 patients in 4-fold cross- 
validation. We achieve a Dice Similarity Coefficient of 83.6±6.3% in training 
and 71.8±10.7% in testing. 

*holger.roth@nih.gov, h.roth@ucl.ac.uk 
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1 Introduction 

Segmentation of the pancreas can be a prerequisite for computer aided diagnosis 
(CADx) systems that provide quantitative organ volume analysis, e.g. for diabetic 
patients. Accurate segmentation could also necessary for computer aided detection 
(CADe) methods to detect pancreatic cancer. Automatic segmentation of numerous 
organs in computed tomography (CT) scans is well studied with good performance 
for organs such as liver, heart or kidneys, where Dice Similarity Coefficients (DSC) 


of >90% are typically achieved Wang et al. (2014), Chu et al. (2013), Wolz et al 


(2013), [Ling et ah (2008). However, achieving high accuracies in automatic pancreas 
segmentation is still a challenging task. The pancreas’ shape, size and location in the 
abdomen can vary drastically between patients. Visceral fat around the pancreas can 
cause large variations in contrast along its boundaries in CT (se e Fig. [^. P revious 
methods report only 46.6% to 69.1% DSCs [Wang et al.l (120141), Ometahj (120131), 


Wolz et al. ( 2013| ), Farag et al. (2014). Recently, the availability of large annotated 
datasets and the accessibility of affordable parallel computing resources via GPUs 
have made it feasible to train deep convolutional networks (ConvNets) for image 
classihcation. Great advances in natural image classification have been achieved 
Krizhevsky et al. (2012). However, deep ConvNets for semantic image segmentation 


have not been well studied Mostajabi et al. (2014). Studies that applied ConvNets 


to medical imaging applications also show good promise on detection tasks Cire§an 


et al. 

(2013) 

Roth et al. 

(2014 


In this paper, we extend and exploit ConvNets for 


a challenging organ segmentation problem. 


2 Methods 

We present a coarse-to-£ne classification scheme with progressive pruning for pan¬ 
creas segmentation. Compared with previous top-down multi-atlas registration and 
label fusion methods, our models approach the problem in a bottom-up fashion: from 
dense labeling of image patches, to regions, and the entire organ. Given an input ab¬ 
domen GT, an initial set of superpixel regions is generated by a coarse cascade process 


of random forests based pancreas segmentation as proposed by Farag et al. (2014) 


These pre-segmented superpixels serve as regional candidates with high sensitivity 
(>97%) but low precision. The resulting initial DSG is ~27% on average. Next, we 
propose and evaluate several variations of GonvNets for segmentation refinement (or 
pruning). A dense local image patch labeling using an axial-coronal-sagittal viewed 
patch (P—GonvNet) is employed in a sliding window manner. This generates a per- 
location probability response map P. A regional GonvNet (Pi—GonvNet) samples 
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a set of bounding boxes covering each image superpixel at multiple spatial scales 
in a “zoom-out” fashion Mostajabi et al. (2014), Girshick et al. (2014) and assigns 


probabilities of being pancreatic tissue. This means that we not only look at the 
close-up view of superpixels, but gradually add more contexts to each candidate re¬ 
gion. i?i-ConvNet operates directly on the CT intensity. Finally, a stacked regional 
i? 2 ~CoiivNet is learned to leverage the joint convolutional features of CT intensities 
and probability maps P. Both 3D Gaussian smoothing and 2D conditional random 
helds for structured prediction are exploited as post-processing. Our methods are 
evaluated on CT scans of 82 patients in 4-fold cross-validation (rather than “leave- 
one-out” evaluation Wang et al. (2014), Chu et al. (2013), Wolz et al. ( |2013 )). We 
propose several new ConvNet models and advance the current state-of-the-art per¬ 
formance to a DSC of 71.8 in testing. To the best of our knowledge, this is the 
highest DSC reported in the literature to date. 


2.1 Candidate region generation 

We describe a coarse-to-£ne pancreas segmentation method employing multi-level 
deep ConvNet models. Our hierarchical segmentation method decomposes any in¬ 
put CT into a set of local image superpixels S = {Si,..., Sn}- After evaluation 


of several image region generation methods Achanta et al. (2012), we chose entropy 


rate Liu et al. (2011) to extract N superpixels on axial slices. This process is based 


on the criterion of DSCs given optimal superpixel labels, in part inspired by the 


PASCAL semantic segmentation challenge Everingham et al. (2014). The optimal 


superpixel labels achieve a DSC upper-bound and are used for supervised learning 


below. Next, we use a two-level cascade of random forest (RF) classihers as in Farag 


et al. (2014). We only operate the RF labeling at a low class-probability cut >0.5 


which is sufficient to reject the vast amount of non-pancreas superpixels. This retains 
a set of superpixels with high recall (>97%) but low precision. After initial 

candidate generation, over-segmentation is expected and observed with low DSCs of 
~27%. The optimal superpixel labeling is limited by the ability of superpixels to 
capture the true pancreas boundaries at the per-pixel level with DS Cm^^ = 80. 


but is still much above previous state-of-the-art Wang et al. (2014), Chu et al. (2013), 


Wolz et al. (2013), Farag et ah (2014). These superpixel labels are used for assessing 


‘positive’ and ‘negative’ superpixel examples for training. Assigning image regions 
drastically reduces the amount of ConvNet observations needed per CT volume com¬ 
pared to a purely patch-based approach and leads to more balanced training data 
sets. Our multi-level deep ConvNets will effectively prune the coarse pancreas over¬ 
segmentation to increase the final DSC measurements. 
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2.2 Convolutional neural network (ConvNet) setup 

We use ConvNets with an architecture for binary image classihcation. Five layers 
of convolutional hlters compute and aggregate image features. Other layers of the 
ConvNets perform max-pooling operations or consist of fully-connected neural net¬ 
works. Our ConvNet ends with a hnal two-way layer with softmax probability for 
‘pancreas’ and ‘non-pancreas’ classihcation (see Fig. [^. The fully-connected layers 
are constrained using “DropOut” in order to avoid over-htting by acting as a regu- 
larizer in training Srivastava et al. (2014). GPU acceleration allows efficient training 
(we use cuda-convnet2^). 
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Figure 1: The proposed ConvNet architecture. The number of convolutional hlters 
and neural network connections for each layer are as shown. This architecture is 
constant for all ConvNet variations presented in this paper (apart from the number 
of input channels): P—ConvNet, Pi—ConvNet, and P 2 —ConvNet. 


2.3 P—ConvNet: Deep patch classification 

We use a sliding window approach that extracts 2.5D image patches composed of 
axial, coronal and sagittal planes within all voxels of the initial set of superpixel 
regions {S'rf} (see Fig. [^. The resulting ConvNet probabilities are denoted as Pq 
hereafter. For efficiency reasons, we extract patches every n voxels and then apply 
nearest neighbor interpolation. This seems sufficient due to the already high quality 
of Pq and the use of overlapping patches to estimate the values at skipped voxels. 

2.4 P—ConvNet: Deep region classification 

We employ the region candidates as inputs. Each superpixel G {Frf} will be observed 
at several scales W with an increasing amount of surrounding contexts (see Fig. |^. 

ffittps://code.google.com/p/cuda-convnet2 
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Figure 2: The first layer of learned convolutional kernels using three representa¬ 
tions: a) 2.5D sliding-window patches (P—ConvNet), b) CT intensity superpixel 
regions (Pi—ConvNet), and c) CT intensity -|- Pq map over superpixel regions 
(P 2 —ConvNet). 
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Figure 3: Axial CT slice of a manual (gold standard) segmentation of the pan¬ 
creas. From left to right, there are the ground-truth segmentation contours (in 
red), RF based coarse segmentation {Rre}, and the deep patch labeling result using 
P—ConvNet. 


Multi-scale contexts are important to disambiguate the complex anatomy in the 
abdomen. We explore two approaches: Pi—ConvNet only looks at the CT intensity 
images extracted from multi-scale superpixel regions, and a stacked P 2 —ConvNet 
integrates an additional channel of patch-level response maps Pq for each region as 
input. As a superpixel can have irregular shapes, we warp each region into a regular 
square (similar to RCNN Girshick et al. (2014)) as is required by most ConvNet 
implementations to date. The ConvNets automatically train their convolutional hlter 
kernels from the available training data. Examples of trained hrst-layer convolutional 
hlters for P—ConvNet, Pi—ConvNet, P 2 —ConvNet are shown in Fig. Deep 
ConvNets behave as effective image feature extractors that summarize multi-scale 
image regions for classihcation. 
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Figure 4: Region classification using R—ConvNet at different scales: a) one-channel 
input based on the intensity image only, and b) two-channel input with additional 
patch-based P—ConvNet response. 


2.5 Data augmentation 


Our ConvNet models (Pi—ConvNet, P 2 —ConvNet) sample the bounding boxes of 
each superpixel G {S'rf} at different scales s. During training, we randomly apply 
non-rigid deformations t to generate more data instances. The degree of deformation 
is chosen so that the resulting warped images resemble plausible physical variations 
of the medical images. This approach is commonly referred to as data augmentation 


and can help avoid over-htting Krizhevsky et al. (2012), Cire§an et al. (2013). Each 


non-rigid training deformation t is computed by htting a thin-plate-spline (TPS) 
to a regular grid of 2D control points {cjpi = 1,2,..., k}. These control points 
are randomly transformed within the sampling window and a deformed image is 
generated using a radial basic function 0(r), where t{x) = D0 (||a^ — 
transformed location of x and {q} is a set of mapping coefficients. 


2.6 Cross-scale and 3D probability aggregation 

At testing, we evaluate each superpixel at Ng different scales. The probability scores 
for each superpixel being pancreas are averaged across scales: p{x) = 

Then the resulting per-superpixel ConvNet classihcation values {pi(a:)} and {p 2 {x)} 
(according to i?i—ConvNet and R 2 —ConvNet, respectively), are directly assigned to 
every pixel or voxel residing within any superpixel G {5 'rf}- This process forms two 
per-voxel probability maps Pi{x) and P 2 {x). Subsequently, we perform 3D Gaussian 
hltering in order to average and smooth the ConvNet probability scores across CT 
slices and within-slice neighboring regions. 3D isotropic Gaussian hltering can be 
applied to any Pk{x) with /c = 0,1, 2 to form smoothed G{Pk{x)). This is a simple way 
to propagate the 2D slice-based probabilities to 3D by taking local 3D neighborhoods 
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into account. In this paper, we do not work on 3D supervoxels due to computational 
efficiency^ and generality issues. We also explore conditional random fields (CRF) 
using an additional ConvNet trained between pairs of neighboring superpixels in 
order to detect the pancreas edge (defined by pairs of superpixels having the same 
or different object labels). This acts as the boundary term together with the regional 
term given by ConvNet in order to perform a min-cut/max-flow segmentation 
Boykov and Funka-Lea (2006). Here, the CRF is implemented as a 2D graph with 


connections between directly neighboring superpixels. The CRF weighting coefficient 
between the boundary and the unary regional term is calibrated by grid-search. 

3 Results & Discussion 


3.0.1 Data: 

Manual tracings of the pancreas for 82 contrast-enhanced abdominal CT volumes 
were provided by an experienced radiologist. Our experiments are conducted using 
4-fold cross-validation in a random hard-split of 82 patients for training and testing 
folds with 21, 21, 20, and 20 patients for each testing fold. We report both training 
and testing segmentation accuracy results. Most previous work Wang et al. (2014), 


Chu et al. (2013), Wolz et al. (2013) uses leave-one-patient-out cross-validation pro¬ 
tocols which are computationally expensive (e.g., ~ 15 hours to process one case 


using a powerful workstation Wang et al. (2014)) and may not scale up efficiently 
towards larger patient populations. More patients (i.e. 20) per testing fold make 

the results more representative for larger population groups. 


3.0.2 Evaluation: 


The ground truth superpixel labels are derived as described in Sec. 2.1 The opti 


mally achievable DSC for superpixel classification (if classified perfectly) is 80.1 
Furthermore, the training data is artificially increased by a factor Ng x W using 
the data augmentation approach with both scale and random TPS deformations at 


the R—ConvNet level (Sec. 2.5). Here, we train on augmented data using Ng = 4, 
Nt = 8. In testing we use Ng = 4 (without deformation based data augmentation) 
and CT = 3 voxels (as 3D Gaussian filtering kernel width) to compute smoothed prob¬ 
ability maps G{P{x)). By tuning our implementation of Farag et al. (2014) at a 


low operating point, the initial superpixel candidate labeling achieves the average 


^Supervoxel based regional ConvNets need at least one-order-of-magnitude wider input layers 
and thus have signihcantly more parameters to train. 
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DSCs of only 26.1% in testing; but has a 97% sensitivity covering all pancreas vox¬ 
els. Fig. [^shows the plots of average DSCs using the proposed ConvNet approaches, 
as a function of Pk{x) and G{Pk{x)) in both training and testing for one fold of 
cross-validation. Simple Gaussian 3D smoothing (Sec. 2.6) markedly improved the 
average DSCs in all cases. Maximum average DSCs can be observed at po = 0.2, 
Pi = 0.5, and p 2 = 0.6 in our training evaluation after 3D Gaussian smoothing 
for this fold. These calibrated operation points are then hxed and used in testing 
cross-validation to obtain the results in Table Utilizing ConvNet (stacked on 
F—ConvNet) and Gaussian smoothing {G{P 2 {x))), we achieve a hnal average DSC of 
71.8% in testing, an improvement of 45.7% compared to the candidate region genera¬ 
tion stage at 26.1%. G{Po{x)) also performs well wiht 69.5% mean DSC and is more 
efficient since only dense deep patch labeling is needed. Even though the absolute 
difference in DSC between G{Po{x)) and G{P 2 {x)) is small, the surface-to-surface 
distance improves signihcantly from 1.46±1.5mm to 0.94±0.6mm, (p<0.01). An ex¬ 
ample of pancreas segmentation at this operation point is shown in Fig. Training 
of a typical F—ConvNet with N x Ng x Nt =~ 850fc superpixel examples of size 
64 X 64 pixels (after warping) takes ~55 hours for 100 epochs on a modern GPU 
(Nvidia GTX Titan-Z). However, execution run-time in testing is in the order of only 
1 to 3 minutes per CT volume, depending on the number of scales Ng. Candidate 
region generation in Sec. 2T consumes another 5 minutes per case. 

To the best of our knowledge, this work reports the highest average DSC with 
71.8% in testing. Note that a direct comparison to previous methods is not possible 
due to lack of publicly available benchmark datasets. We will share our data and 
code implementation for future comparisons^’'^. Previous state-of-the-art results are 

( 2013[ ), [Farag 


at ~68% to ~69% Wang et al. (2014), Chu et al. (2013), Wolz et al. 


et al. (2014). In particular, DSC drops from 68% (150 patients) to 58% (50 patients) 
under the leave-one-out protocol Wolz et al. (2013). Our results are based on a 4-fold 
cross-validation. The performance degrades gracefully from training (83.6±6.3%) to 
testing (71.8±10.7%) which demonstrates the good generality of learned deep Con- 
vNets on unseen data. This difference is expected to diminish with more annotated 
datasets. Our methods also perform with better stability (i.e., comparing 10.7% 
versus 18.6% [Wang et alT (2014), 15.3% Chu et al. (2013) in the standard devia¬ 
tion of DSCs). Our maximum test performance is 86.9% DSC with 10%, 30%, 50%, 
70%, 80%, and 90% of cases being above 81.4%, 77.6%, 74.2%, 69.4%, 65.2% and 
58.9%, respectively. Only 2 outlier cases lie below 40% DSC (mainly caused by over¬ 
segmentation into other organs). The remaining 80 testing cases are all above 50%. 


^http://www.cc.nih.gov/about/SeniorStaff/roiiald_summers.html 
^http://www.holgerroth.com/ 
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ConvNet probability 


Figure 5: Average DSCs as a function of un-smoothed Pk{x),k = 0,1,2, and 3D 
smoothed G{Pk{x)), k = 0,1,2, ConvNet probability maps in training (left) and 
testing (right) in one cross-validation fold 


The minimal DSC value of these outliers is 25.0% for G{P 2 {x)). However Wang 


et ah (2014), Chu et ah| (2013), Wolz et al. (2013), Farag et al. (2014) all report 


gross segmentation failure cases with DSC even below 10%. Lastly, the variation 
GRF{P 2 {x)) of enforcing P 2 {x) within a structured prediction CRF model achieves 
only 68.2% ±4.1%. This is probably due to the already high quality of G{Po) and 
G{P 2 ) in comparison. 


Table 1: 4-fold cross-validation; optimally achievable DSCs, our initial candidate 
region labeling using Srf, DSCs on P{x) and using smoothed G{P{x)), and a CRF 


model for structured prediction (best performance in bold). 


DSC (%) 

Opt. 

Srf(x) 

PoC) 

G(Po(D) 

DC) 

G(PiC)) 

DC) 

G(P2C)) 

CRF{P2{x)) 

Mean 

80.5 

26.1 

60.9 

69.5 

56.8 

62.9 

64.9 

71.8 

68.2 

Std 

3.6 

7.1 

10.4 

9.3 

11.4 

16.1 

8.1 

10.7 

4.1 

Min 

70.9 

14.2 

22.9 

35.3 

1.3 

0.0 

33.1 

25.0 

59.6 

Max 

85.9 

45.8 

80.1 

84.4 

77.4 

87.3 

77.9 

86.9 

74.2 


9 















































DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas 
H. R. Roth et al. Segmentation 



a) b) c) 


Figure 6: Example of pancreas segmentation nsing the proposed -R 2 —ConvNet ap¬ 
proach in testing, a) The mannal gronnd trnth annotation (in red ontline); b) the 
G{P 2 {x)) probability map; c) the hnal segmentation (in green ontline) at p 2 = 0.6 
(DSC=82.7%). 


4 Conclusion 


We present a bottom-np, coarse-to-£ne approach for pancreas segmentation in ab¬ 
dominal CT scans. Mnlti-level deep ConvNets are employed on both image patches 
and regions. We achieve the highest reported DSCs of 71.8±10.7% in testing and 
83.6±6.3% in training, at the compntational cost of a few minntes, not honrs as in 


Wang et al. (2014), Chn et al. (2013), Wolz et al. (2013). The proposed approach 


can be incorporated into mnlti-organ segmentation frameworks by specifying more 


tissne types since ConvNet natnrally snpports mnlti-class classifications Krizhevsky 


et al. (2012). Onr deep learning based organ segmentation approach conld be gen- 


eralizable to other segmentation problems with large variations and pathologies, e.g. 
tnmors. 


Acknowledgments: This work was snpported by the Intramnral Research Pro¬ 
gram of the NIH Clinical Center. The hnal pnblication will be available at Springer. 
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